Jaccard Score (Jaccard Similarity / Intersection-over-Union)#

The Jaccard score measures similarity between two sets:

\[ J(A, B) = \frac{|A \cap B|}{|A \cup B|} \]

In ML, you’ll often see the same idea as Intersection-over-Union (IoU) for binary masks.

Goals#

  • Build intuition for intersection vs union (and why true negatives don’t matter).

  • Derive the classification form: \(\displaystyle \frac{TP}{TP+FP+FN}\).

  • Implement Jaccard from scratch in NumPy (binary, multiclass, multilabel).

  • Use Plotly to visualize how thresholds and errors change the score.

  • Optimize a tiny logistic regression model with a differentiable soft Jaccard loss.

Quick import (scikit-learn)#

from sklearn.metrics import jaccard_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

versions = {
    'numpy': np.__version__,
    'plotly': __import__('plotly').__version__,
}
try:
    import sklearn

    versions['sklearn'] = sklearn.__version__
except Exception:
    versions['sklearn'] = None

versions
{'numpy': '1.26.2', 'plotly': '6.5.2', 'sklearn': '1.6.0'}

Prerequisites & notation#

  • Binary labels: \(y \in \{0,1\}^n\)

  • Predicted labels: \(\hat{y} \in \{0,1\}^n\)

  • Predicted probabilities: \(p \in [0,1]^n\)

  • Confusion counts: \(TP\), \(FP\), \(FN\), \(TN\)

We’ll interpret the “positive set” as the indices where a vector equals 1: \(A = \{ i : y_i = 1 \}\) and \(B = \{ i : \hat{y}_i = 1 \}\).

1) Set intuition#

Think of two sets:

  • \(A\): the “true” items

  • \(B\): the “predicted” items

The Jaccard score is:

\[ J(A,B) = \frac{|A \cap B|}{|A \cup B|} \]
  • Numerator: what both agree on (overlap)

  • Denominator: everything that appears in either (coverage)

So Jaccard is high only when the overlap is large and the union isn’t bloated by extras.

A = {1, 2, 3, 5, 8}
B = {2, 3, 4, 8, 9}

intersection = A & B
union = A | B

jaccard = len(intersection) / len(union)

A, B, intersection, union, jaccard
({1, 2, 3, 5, 8},
 {2, 3, 4, 8, 9},
 {2, 3, 8},
 {1, 2, 3, 4, 5, 8, 9},
 0.42857142857142855)
universe = np.arange(0, 10)

A_mask = np.isin(universe, sorted(A))
B_mask = np.isin(universe, sorted(B))

# 0: neither, 1: A only, 2: B only, 3: both
cat = A_mask.astype(int) + 2 * B_mask.astype(int)

colorscale = [
    [0.00, '#ffffff'],
    [0.249999, '#ffffff'],  # neither
    [0.25, '#ff7f0e'],
    [0.499999, '#ff7f0e'],  # A only
    [0.50, '#1f77b4'],
    [0.749999, '#1f77b4'],  # B only
    [0.75, '#2ca02c'],
    [1.00, '#2ca02c'],  # both (intersection)
]

fig = go.Figure(
    data=go.Heatmap(
        z=cat[np.newaxis, :],
        x=universe,
        y=['elements'],
        colorscale=colorscale,
        zmin=-0.5,
        zmax=3.5,
        colorbar=dict(
            title='category',
            tickmode='array',
            tickvals=[0, 1, 2, 3],
            ticktext=['neither', 'A only', 'B only', 'A ∩ B'],
        ),
        hovertemplate='element=%{x}<br>category=%{z}<extra></extra>',
    )
)

fig.update_layout(
    title=f'Jaccard = |A ∩ B| / |A ∪ B| = {len(intersection)}/{len(union)} = {jaccard:.3f}',
    height=220,
    margin=dict(l=20, r=20, t=60, b=20),
)

fig.show()

2) Binary classification view (TP / FP / FN)#

For binary classification, focus on the positive class:

  • \(A = \{ i : y_i = 1 \}\) (true positives set)

  • \(B = \{ i : \hat{y}_i = 1 \}\) (predicted positives set)

Then:

  • \(|A \cap B| = TP\)

  • \(|A \cup B| = TP + FP + FN\)

So the Jaccard score becomes:

\[ J = \frac{TP}{TP + FP + FN} \]

Notice what’s missing: true negatives \(TN\). If your dataset has tons of negatives, accuracy can look great while Jaccard stays low.

def confusion_counts_binary(y_true, y_pred):
    y_true = np.asarray(y_true).astype(bool)
    y_pred = np.asarray(y_pred).astype(bool)

    tp = np.logical_and(y_true, y_pred).sum()
    fp = np.logical_and(~y_true, y_pred).sum()
    fn = np.logical_and(y_true, ~y_pred).sum()
    tn = np.logical_and(~y_true, ~y_pred).sum()

    return int(tp), int(fp), int(fn), int(tn)


def jaccard_score_binary(y_true, y_pred, *, zero_division=0.0):
    tp, fp, fn, _ = confusion_counts_binary(y_true, y_pred)
    denom = tp + fp + fn
    if denom == 0:
        return float(zero_division)
    return tp / denom


def accuracy_score_binary(y_true, y_pred):
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    return (y_true == y_pred).mean()


# quick sanity check
y_true = np.array([1, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 0, 0, 1, 1])

tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred)
(tp, fp, fn, tn), jaccard_score_binary(y_true, y_pred), accuracy_score_binary(y_true, y_pred)
((2, 1, 1, 2), 0.5, 0.6666666666666666)

2.1 IoU for segmentation (same formula)#

If \(y\) and \(\hat{y}\) are binary masks (pixels in/out of an object), then:

  • intersection = pixels correctly predicted as object

  • union = pixels that are object in either mask

So IoU = Jaccard on the set of “object pixels”.

h, w = 40, 40
yy, xx = np.mgrid[0:h, 0:w]


def circle_mask(*, cx, cy, r):
    return (xx - cx) ** 2 + (yy - cy) ** 2 <= r**2


true_mask = circle_mask(cx=14, cy=20, r=10)
pred_mask = circle_mask(cx=18, cy=20, r=10)

# 0: background, 1: true-only (FN), 2: pred-only (FP), 3: overlap (TP)
cat = true_mask.astype(int) + 2 * pred_mask.astype(int)
iou = jaccard_score_binary(true_mask.ravel(), pred_mask.ravel(), zero_division=1.0)

colorscale = [
    [0.00, '#ffffff'],
    [0.249999, '#ffffff'],
    [0.25, '#d62728'],
    [0.499999, '#d62728'],  # true-only (red)
    [0.50, '#1f77b4'],
    [0.749999, '#1f77b4'],  # pred-only (blue)
    [0.75, '#2ca02c'],
    [1.00, '#2ca02c'],  # overlap (green)
]

fig = go.Figure(
    data=go.Heatmap(
        z=cat,
        colorscale=colorscale,
        zmin=-0.5,
        zmax=3.5,
        showscale=True,
        colorbar=dict(
            title='pixel',
            tickmode='array',
            tickvals=[0, 1, 2, 3],
            ticktext=['background', 'true only (FN)', 'pred only (FP)', 'overlap (TP)'],
        ),
        hovertemplate='x=%{x}<br>y=%{y}<br>category=%{z}<extra></extra>',
    )
)

fig.update_layout(
    title=f'IoU (Jaccard) on a toy mask: {iou:.3f}',
    width=520,
    height=520,
    yaxis=dict(scaleanchor='x', autorange='reversed'),
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()

2.2 Why true negatives don’t matter#

Hold \(TP\), \(FP\), \(FN\) fixed and add more and more true negatives.

  • Accuracy goes up (because it counts \(TN\)).

  • Jaccard stays exactly the same (because it ignores \(TN\)).

tp, fp, fn = 10, 5, 5
y_true_core = np.array([1] * tp + [1] * fn + [0] * fp, dtype=int)
y_pred_core = np.array([1] * tp + [0] * fn + [1] * fp, dtype=int)

tn_sizes = np.arange(0, 2001, 100)
accs = []
jaccs = []

for tn in tn_sizes:
    y_true_full = np.concatenate([y_true_core, np.zeros(tn, dtype=int)])
    y_pred_full = np.concatenate([y_pred_core, np.zeros(tn, dtype=int)])

    accs.append(accuracy_score_binary(y_true_full, y_pred_full))
    jaccs.append(jaccard_score_binary(y_true_full, y_pred_full))

fig = go.Figure()
fig.add_trace(go.Scatter(x=tn_sizes, y=accs, mode='lines+markers', name='accuracy'))
fig.add_trace(go.Scatter(x=tn_sizes, y=jaccs, mode='lines+markers', name='jaccard'))

fig.update_layout(
    title=f'Add more TN with TP={tp}, FP={fp}, FN={fn}: Jaccard stays constant',
    xaxis_title='number of added true negatives (TN)',
    yaxis_title='score',
    yaxis=dict(range=[0, 1]),
)
fig.show()

3) How FP and FN move Jaccard#

For fixed \(TP\), Jaccard shrinks as you add either false positives or false negatives:

\[ J = \frac{TP}{TP + FP + FN} \]
TP = 10
FP_vals = np.arange(0, 31)
FN_vals = np.arange(0, 31)

Z = np.zeros((len(FN_vals), len(FP_vals)), dtype=float)
for i, fn in enumerate(FN_vals):
    for j, fp in enumerate(FP_vals):
        Z[i, j] = TP / (TP + fp + fn)

fig = px.imshow(
    Z,
    x=FP_vals,
    y=FN_vals,
    origin='lower',
    aspect='auto',
    labels={'x': 'FP', 'y': 'FN', 'color': 'Jaccard'},
    title=f'Jaccard for fixed TP={TP}',
)
fig.show()

4) Relationship to precision/recall/F1#

  • Precision: \(\displaystyle P = \frac{TP}{TP+FP}\)

  • Recall: \(\displaystyle R = \frac{TP}{TP+FN}\)

  • F1: \(\displaystyle F_1 = \frac{2TP}{2TP+FP+FN}\)

Jaccard uses the same ingredients but with a different denominator:

\[ J = \frac{TP}{TP+FP+FN} \]

A useful identity links Jaccard and F1:

\[ J = \frac{F_1}{2 - F_1} \quad\Longleftrightarrow\quad F_1 = \frac{2J}{1 + J} \]
f1 = np.linspace(0, 1, 501)
j_from_f1 = f1 / (2 - f1)

fig = go.Figure()
fig.add_trace(go.Scatter(x=f1, y=j_from_f1, mode='lines', name='J = F1/(2-F1)'))
fig.update_layout(
    title='Mapping between F1 and Jaccard',
    xaxis_title='F1',
    yaxis_title='Jaccard',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
)
fig.show()

5) Multilabel and multiclass#

Multilabel#

Each sample can have multiple positive labels. If \(y, \hat{y} \in \{0,1\}^{n\times L}\), you can compute Jaccard:

  • per-sample and average (samples)

  • per-label and average (macro)

  • globally over all entries (micro)

Multiclass#

With mutually-exclusive classes, a common definition is one-vs-rest per class and then average. This matches the way sklearn.metrics.jaccard_score generalizes Jaccard when average != 'binary'.

def _safe_divide(num, den, *, zero_division=0.0):
    num = np.asarray(num, dtype=float)
    den = np.asarray(den, dtype=float)

    out = np.full_like(num, float(zero_division), dtype=float)
    mask = den != 0
    out[mask] = num[mask] / den[mask]
    return out


def jaccard_score_multilabel(y_true, y_pred, *, average='samples', zero_division=0.0):
    y_true = np.asarray(y_true).astype(bool)
    y_pred = np.asarray(y_pred).astype(bool)

    if y_true.ndim != 2:
        raise ValueError('Expected y_true with shape (n_samples, n_labels)')
    if y_pred.shape != y_true.shape:
        raise ValueError('y_pred must have the same shape as y_true')

    if average == 'micro':
        inter = np.logical_and(y_true, y_pred).sum()
        uni = np.logical_or(y_true, y_pred).sum()
        return float(_safe_divide(inter, uni, zero_division=zero_division))

    if average in (None, 'none'):
        inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
        uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
        return _safe_divide(inter_l, uni_l, zero_division=zero_division)

    if average in ('macro', 'weighted'):
        inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
        uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
        label_scores = _safe_divide(inter_l, uni_l, zero_division=zero_division)
        if average == 'macro':
            return float(label_scores.mean())

        supports = y_true.sum(axis=0)
        if supports.sum() == 0:
            return float(zero_division)
        return float(np.average(label_scores, weights=supports))

    if average == 'samples':
        inter_s = np.logical_and(y_true, y_pred).sum(axis=1)
        uni_s = np.logical_or(y_true, y_pred).sum(axis=1)
        sample_scores = _safe_divide(inter_s, uni_s, zero_division=zero_division)
        return float(sample_scores.mean())

    raise ValueError("average must be one of {'samples','micro','macro','weighted',None}")


def jaccard_score_multiclass(y_true, y_pred, *, average='macro', labels=None, zero_division=0.0):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.ndim != 1 or y_pred.ndim != 1:
        raise ValueError('Expected 1D label arrays')
    if y_pred.shape != y_true.shape:
        raise ValueError('y_pred must have the same shape as y_true')
    if len(y_true) == 0:
        return float(zero_division)

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))

    scores = []
    supports = []
    for lab in labels:
        t = y_true == lab
        p = y_pred == lab
        tp = np.logical_and(t, p).sum()
        fp = np.logical_and(~t, p).sum()
        fn = np.logical_and(t, ~p).sum()
        denom = tp + fp + fn

        score = float(zero_division) if denom == 0 else float(tp / denom)
        scores.append(score)
        supports.append(t.sum())

    scores = np.asarray(scores, dtype=float)
    supports = np.asarray(supports, dtype=float)

    if average == 'macro':
        return float(scores.mean())
    if average == 'weighted':
        if supports.sum() == 0:
            return float(zero_division)
        return float(np.average(scores, weights=supports))
    if average == 'micro':
        correct = (y_true == y_pred).sum()
        union = 2 * len(y_true) - correct
        return float(zero_division) if union == 0 else float(correct / union)
    if average in (None, 'none'):
        return scores

    raise ValueError("average must be one of {'micro','macro','weighted',None}")


# examples
y_true_ml = np.array(
    [
        [1, 0, 1],
        [0, 1, 0],
        [1, 1, 0],
        [0, 0, 0],
    ],
    dtype=int,
)
y_pred_ml = np.array(
    [
        [1, 1, 1],
        [0, 1, 0],
        [0, 1, 0],
        [0, 0, 0],
    ],
    dtype=int,
)

scores = {
    'samples': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='samples', zero_division=1.0),
    'micro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='micro', zero_division=1.0),
    'macro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0),
    'weighted': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='weighted', zero_division=1.0),
    'per-label': jaccard_score_multilabel(y_true_ml, y_pred_ml, average=None, zero_division=1.0),
}

y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])

scores, {
    'multiclass_macro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'),
    'multiclass_micro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='micro'),
    'multiclass_per_class': jaccard_score_multiclass(y_true_mc, y_pred_mc, average=None),
}
({'samples': 0.7916666666666666,
  'micro': 0.6666666666666666,
  'macro': 0.7222222222222222,
  'weighted': 0.6666666666666666,
  'per-label': array([0.5   , 0.6667, 1.    ])},
 {'multiclass_macro': 0.5555555555555555,
  'multiclass_micro': 0.5,
  'multiclass_per_class': array([1.    , 0.3333, 0.3333])})
try:
    from sklearn.metrics import jaccard_score as sk_jaccard_score

    print('Binary (sklearn):', sk_jaccard_score(y_true, y_pred))
    print('Binary (ours):   ', jaccard_score_binary(y_true, y_pred))

    print('Multilabel macro (sklearn):', sk_jaccard_score(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))
    print('Multilabel macro (ours):   ', jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))

    print('Multiclass macro (sklearn):', sk_jaccard_score(y_true_mc, y_pred_mc, average='macro'))
    print('Multiclass macro (ours):   ', jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'))
except Exception as e:
    print('sklearn not available:', e)
Binary (sklearn): 0.5
Binary (ours):    0.5
Multilabel macro (sklearn): 0.7222222222222222
Multilabel macro (ours):    0.7222222222222222
Multiclass macro (sklearn): 0.5555555555555555
Multiclass macro (ours):    0.5555555555555555

6) Thresholding probabilities#

Jaccard is defined on sets / hard labels. If your model outputs probabilities \(p\), you typically choose a threshold \(t\) and set:

\[ \hat{y}_i = \mathbf{1}[p_i \ge t] \]

Different thresholds trade off \(FP\) vs \(FN\), so they can change Jaccard a lot.

n = 400
y_true_thr = rng.binomial(1, 0.15, size=n)

# simulate a "model score": positives tend to have higher logits
logits = rng.normal(loc=0.0, scale=1.0, size=n) + 1.5 * y_true_thr
p_thr = 1 / (1 + np.exp(-logits))

thresholds = np.linspace(0.0, 1.0, 201)
j_scores = np.array(
    [jaccard_score_binary(y_true_thr, (p_thr >= t).astype(int), zero_division=0.0) for t in thresholds]
)
best_idx = int(j_scores.argmax())
best_t = float(thresholds[best_idx])
best_j = float(j_scores[best_idx])

fig = px.line(
    x=thresholds,
    y=j_scores,
    labels={'x': 'threshold', 'y': 'Jaccard'},
    title=f'Jaccard vs threshold (best t≈{best_t:.2f}, J≈{best_j:.3f})',
)
fig.add_vline(x=best_t, line_dash='dash', line_color='black')
fig.update_layout(yaxis=dict(range=[0, 1]))
fig.show()

7) Using Jaccard in optimization: a soft Jaccard loss#

The “hard” Jaccard score uses discrete predictions, so it’s not differentiable w.r.t. model parameters.

A common trick (especially in segmentation) is to replace hard predictions with probabilities \(p\):

  • Soft intersection: \(I = \sum_i y_i p_i\)

  • Soft union: \(U = \sum_i y_i + \sum_i p_i - \sum_i y_i p_i\)

Soft Jaccard:

\[ J_{soft}(y,p) = \frac{I + \varepsilon}{U + \varepsilon} \]

Soft Jaccard loss:

\[ \mathcal{L}_{IoU}(y,p) = 1 - J_{soft}(y,p) \]

Gradient w.r.t. a probability \(p_i\):

\[ \frac{\partial J_{soft}}{\partial p_i} = \frac{y_i (U+\varepsilon) - (I+\varepsilon)(1-y_i)}{(U+\varepsilon)^2} \]

Then use the chain rule for logistic regression, where \(p_i = \sigma(x_i^\top w)\).

# Synthetic 2D binary classification (imbalanced)
n0, n1 = 900, 100
X0 = rng.normal(loc=[0.0, 0.0], scale=[1.0, 1.0], size=(n0, 2))
X1 = rng.normal(loc=[2.0, 2.0], scale=[1.0, 1.0], size=(n1, 2))
X = np.vstack([X0, X1])
y = np.array([0] * n0 + [1] * n1, dtype=int)

# shuffle
perm = rng.permutation(len(y))
X = X[perm]
y = y[perm]

# train/test split (pure NumPy)
test_size = 0.30
n_test = int(len(y) * test_size)

X_test = X[:n_test]
y_test = y[:n_test]
X_train = X[n_test:]
y_train = y[n_test:]

# standardize (fit on train)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma

# add bias column
Xb_train = np.c_[np.ones(len(y_train)), X_train_s]
Xb_test = np.c_[np.ones(len(y_test)), X_test_s]

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    title='Training data (standardized)',
    labels={'color': 'class'},
)
fig.show()

Xb_train.shape, Xb_test.shape, float(y_train.mean()), float(y_test.mean())
((700, 3), (300, 3), 0.10285714285714286, 0.09333333333333334)
def sigmoid(z):
    z = np.clip(z, -60, 60)
    return 1 / (1 + np.exp(-z))


def log_loss(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    p = np.clip(p, eps, 1 - eps)
    return float(-np.mean(y * np.log(p) + (1 - y) * np.log(1 - p)))


def soft_jaccard_loss(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    I = np.sum(y * p)
    U = np.sum(y) + np.sum(p) - I
    return float(1.0 - (I + eps) / (U + eps))


def soft_jaccard_grad_p(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    I = np.sum(y * p)
    U = np.sum(y) + np.sum(p) - I
    Ieps = I + eps
    Ueps = U + eps
    dJdp = (y * Ueps - Ieps * (1 - y)) / (Ueps**2)
    return -dJdp


def fit_logreg_gd(Xb, y, *, loss='log', lr=0.1, n_iter=400, l2=0.0, record_every=5):
    y = np.asarray(y, dtype=float)
    w = np.zeros(Xb.shape[1], dtype=float)

    history = {'iter': [], 'loss': [], 'jaccard@0.5': []}

    for t in range(n_iter):
        z = Xb @ w
        p = sigmoid(z)

        if loss == 'log':
            L = log_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
            grad = Xb.T @ (p - y) / len(y)
            grad[1:] += l2 * w[1:]
        elif loss == 'soft_jaccard':
            L = soft_jaccard_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
            dLdp = soft_jaccard_grad_p(y, p)
            dLdz = dLdp * p * (1 - p)
            grad = Xb.T @ dLdz / len(y)
            grad[1:] += l2 * w[1:]
        else:
            raise ValueError("loss must be 'log' or 'soft_jaccard'")

        w -= lr * grad

        if (t % record_every) == 0 or t == (n_iter - 1):
            y_hat = (p >= 0.5).astype(int)
            j = jaccard_score_binary(y.astype(int), y_hat, zero_division=0.0)
            history['iter'].append(t)
            history['loss'].append(float(L))
            history['jaccard@0.5'].append(float(j))

    return w, history


def best_threshold_for_jaccard(y_true, p, thresholds):
    scores = np.array(
        [jaccard_score_binary(y_true, (p >= t).astype(int), zero_division=0.0) for t in thresholds], dtype=float
    )
    best_idx = int(scores.argmax())
    return float(thresholds[best_idx]), float(scores[best_idx]), scores
# Train two models:
# - standard logistic regression (log-loss)
# - logistic regression with a soft Jaccard loss

w_log, hist_log = fit_logreg_gd(
    Xb_train,
    y_train,
    loss='log',
    lr=0.2,
    n_iter=400,
    l2=0.01,
    record_every=5,
)
w_iou, hist_iou = fit_logreg_gd(
    Xb_train,
    y_train,
    loss='soft_jaccard',
    lr=1.0,
    n_iter=400,
    l2=0.01,
    record_every=5,
)

# Evaluate on test
p_test_log = sigmoid(Xb_test @ w_log)
p_test_iou = sigmoid(Xb_test @ w_iou)

j05_log = jaccard_score_binary(y_test, (p_test_log >= 0.5).astype(int), zero_division=0.0)
j05_iou = jaccard_score_binary(y_test, (p_test_iou >= 0.5).astype(int), zero_division=0.0)

ths = np.linspace(0.01, 0.99, 99)
best_t_log, best_j_log, curve_log = best_threshold_for_jaccard(y_test, p_test_log, ths)
best_t_iou, best_j_iou, curve_iou = best_threshold_for_jaccard(y_test, p_test_iou, ths)

summary = {
    'log_loss': {'J@0.5': j05_log, 'best_t': best_t_log, 'best_J': best_j_log},
    'soft_jaccard': {'J@0.5': j05_iou, 'best_t': best_t_iou, 'best_J': best_j_iou},
}
summary
{'log_loss': {'J@0.5': 0.6206896551724138,
  'best_t': 0.38,
  'best_J': 0.7666666666666667},
 'soft_jaccard': {'J@0.5': 0.175,
  'best_t': 0.51,
  'best_J': 0.2857142857142857}}
# Training curves (loss)
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_log['iter'], y=hist_log['loss'], mode='lines', name='log-loss (train)'))
fig.add_trace(go.Scatter(x=hist_iou['iter'], y=hist_iou['loss'], mode='lines', name='soft Jaccard loss (train)'))
fig.update_layout(title='Training loss curves (different scales)', xaxis_title='iteration', yaxis_title='loss')
fig.show()

# Training curves (Jaccard at threshold 0.5)
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=hist_log['iter'], y=hist_log['jaccard@0.5'], mode='lines', name='log-loss model')
)
fig.add_trace(
    go.Scatter(x=hist_iou['iter'], y=hist_iou['jaccard@0.5'], mode='lines', name='soft Jaccard model')
)
fig.update_layout(
    title='Training: Jaccard@0.5 over iterations',
    xaxis_title='iteration',
    yaxis_title='Jaccard@0.5',
    yaxis=dict(range=[0, 1]),
)
fig.show()

# Threshold tuning on test: maximize Jaccard
fig = go.Figure()
fig.add_trace(go.Scatter(x=ths, y=curve_log, mode='lines', name='log-loss model'))
fig.add_trace(go.Scatter(x=ths, y=curve_iou, mode='lines', name='soft Jaccard model'))
fig.add_vline(x=best_t_log, line_dash='dash', line_color='black')
fig.add_vline(x=best_t_iou, line_dash='dash', line_color='gray')
fig.update_layout(
    title='Test: Jaccard vs threshold (vertical lines = best thresholds)',
    xaxis_title='threshold',
    yaxis_title='Jaccard',
    yaxis=dict(range=[0, 1]),
)
fig.show()

8) Pros, cons, and where Jaccard shines#

Pros#

  • Interpretable overlap measure in \([0,1]\).

  • Ignores true negatives, which is great when negatives are overwhelming (e.g. segmentation background).

  • Natural fit for sets, sparse binary features, multi-label problems.

  • Symmetric: \(J(A,B)=J(B,A)\).

Cons#

  • Because it ignores \(TN\), it can be misleading when correct negatives matter.

  • The “hard” metric is non-differentiable, so you usually optimize a surrogate.

  • For small objects in segmentation, a small boundary shift can drop IoU a lot.

  • For multiclass/multilabel, results depend heavily on the averaging choice (micro vs macro vs samples).

Good use cases#

  • Image segmentation / detection masks (IoU)

  • Multi-label classification (tags)

  • Information retrieval and matching (set overlap)

9) Pitfalls & diagnostics#

  • Union = 0 edge case: if both sets are empty, Jaccard is undefined (\(0/0\)). Decide a convention (zero_division).

  • Threshold choice: Jaccard can change a lot with the threshold; tune it on a validation set.

  • Averaging:

    • micro favors frequent labels/classes

    • macro treats each label/class equally (often better for rare labels)

    • samples answers: “how good are we per example?” (multilabel)

  • Compare with precision/recall to see whether low IoU comes from extra positives (FP) or missed positives (FN).

10) Exercises#

  1. Create two predictions with the same accuracy but very different Jaccard. Explain the difference using \(TP/FP/FN/TN\).

  2. For multilabel data, build a case where micro is high but macro is low. What does that imply?

  3. Implement a soft Dice (F1) loss and compare its behavior to soft Jaccard on the same toy dataset.

References#

  • scikit-learn jaccard_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html

  • IoU/Jaccard loss in segmentation (overview): https://en.wikipedia.org/wiki/Jaccard_index